Overview and Motivation

Sabermetrics has become quite popular, especially in more recent years. Researchers have developed metrics that are widely used to gauge player performance. The search for more reliable and accurate measures is still ongoing. Specifically, teams are interested in buying wins. We will begin our search by analyzing an increasingly popular but not officially recognized metric: wins above replacement (WAR), which attempts to be a single metric which directly quantifies how many wins a player is worth to a team. However, as a statistic not officially recognized by Major League Baseball, there is no universal way of calculating WAR, with the three most popular versions coming from Baseball Reference, Baseball Prospectus, and Fan Graph But how well do these three versions compare in predicting true wins for a team from team total WAR? We address this question, and provde some further analysis. In doing so, we also aim to derive a more “comprehensive” metric, one that is a combination of the three to improve results.

Initial Questions

Some initial questions we wish to answer include: - What is the relationship between WAR and the number of wins a team sees? - How accurately does WAR predict the number of wins a team observes in a given year? - Are there any systematic issues with WAR in predicting true wins? - How can the three different ways to calculate WAR be combined to produce a more reliable statistic? - What are some efficient techniques that one could use to scrape data from Fangraphs, Baseball Prospectus, and Baseball Reference, all of which are commonly visited websites in the baseball world?

Data

WAR, again a non-standard metric, is based on a number of factors including batting runs, baserunning runs, runs added or lost due to grounding into double plays in double play situations, fielding runs, positional adjustment, replacement level, and even stadium adjustments. Each of our sources do this in a slightly different way.

Our data was obtained from the Teams dataset in the Lahman package (there is no WAR statistic in Lahman) and from three external sources: Baseball Prospectus, Baseball Reference, and Fangraphs. Due to the different formats of the three sources, the issues encountered and the techniques to solve them (detailed below), data scraping proved a large effort, and still required extensive cleaning to arrive at a complete and usable dataset.

Baseball Prospectus

Baseball Prospectus was the least challenging of the three for data wrangling. All players for each season were displayed on a single page with their respective team and BWARP (Baseball Prospectus’s version of WAR) annually for each year from 1871 to 2017 as can be seen in the image below. To scrape year-specific data, we vectorized over the range of years and used the function set_values and submit_form to automate the selection process and extracted the table of interest by identifying the corresponding html_node. The code has been provided below (though we recommend caution in running any of our data wrangling code blocks as they can take on the order of hours to run; we wrote all of our wrangled data to csv files for much faster data access in the later analyses).

url_batting <- "http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181"
url_pitching <- "http://legacy.baseballprospectus.com/sortable/index.php?cid=1931167"

prospectus <- function(year, url){
  year = as.character(year)
  pgsession <- html_session(url)
  pgform <- html_form(pgsession)[[3]]
  
  #Fill in selection form
  filled_form <-set_values(pgform,
                           "year" = year
  )
  
  #Submit selection form
  d <- submit_form(session = pgsession, form = filled_form)
  dat <- d %>%
    html_nodes("table") %>%
    .[[5]] %>%
    html_table(header = Truth)
  dat
}

#Submit Year = X for X in [1871, 2017] to obtain year-specific data
prospectus_batting <- do.call(rbind, lapply(1871:2017, prospectus, url = url_batting))
prospectus_pitching <- do.call(rbind, lapply(1871:2017, prospectus, url = url_pitching))

#Reformat data 
prospectus_pitching <- read.csv("prospectus_pitching_full.csv")
prospectus_batting <- read.csv("prospectus_batting_full.csv")

prospectus_pitching_small <- prospectus_pitching[, c("NAME", "TEAM", "YEAR", "PWARP")]
names(prospectus_pitching_small) <- c("Name", "Team", "Year", "WARP")
prospectus_batting_small <- prospectus_batting[, c("NAME", "TEAM", "YEAR", "BWARP")]
names(prospectus_batting_small) <- c("Name", "Team", "Year", "WARP")

Baseball Reference

The data on Baseball Reference are grouped by team and years. Separate tables are displayed for batters and pitchers. The screenshot below provides an example of one of the tables shown for the New York Yankees in 2017.

Links to these tables follow a general pattern, so a vector of all the possible links was created. We then looped through all of these links to grab the tables from each page. Final touches were made to get the datasets into the right format (agin, the computation time on the was very large).

teams <- unique(Teams$franchID)
years <- 1871:2017

urls <- matrix(0, length(teams), length(years))
for(i in 1:length(teams)) {
  for(j in 1:length(years)) {
    urls[i, j] <- paste0("https://www.baseball-reference.com/teams/", teams[i], "/", years[j], ".shtml")
  }
}
url_vector <- as.vector(urls)

list_of_batting <- list()
list_of_pitching <- list()
for(i in 1:5000) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

for(i in 5001:10000) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

for(i in 10001:17640) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

## find indices where the link exists
ind_batting <- which(!is.na(list_of_batting))
ind_pitching <- which(!is.na(list_of_pitching))

## find links that exist
url_batting <- url_vector[ind_batting]
url_pitching <- url_vector[ind_pitching]

## extract year from each url
years_batting <- as.numeric(unlist(regmatches(url_batting, gregexpr("[[:digit:]]+", url_batting))))
years_pitching <- as.numeric(unlist(regmatches(url_pitching, gregexpr("[[:digit:]]+", url_pitching))))

## extract team from each url
teams_batting <- basename(dirname(url_batting))
teams_pitching <- basename(dirname(url_pitching))

## remove NAs from lists
na.omit.list <- function(y) { 
  return(y[!sapply(y, function(x) all(is.na(x)))]) 
}

test_batting <- na.omit.list(list_of_batting)
test_pitching <- na.omit.list(list_of_pitching)

## add columns for year and team
test_batting <- mapply(cbind, test_batting, "Year" = years_batting, 
                       "Team" = teams_batting, SIMPLIFY = F)
test_pitching <- mapply(cbind, test_pitching, "Year" = years_pitching,
                        "Team" = teams_pitching, SIMPLIFY = F)

bbref_batting <- bind_rows(test_batting)
bbref_pitching <- bind_rows(test_pitching)

Fan Graphs

On Fan Graph, data is available for each player for each of the current 30 MLB teams as far back as their respective foundings. The URLs for these pages were in a standard format though, like Baseball Prospectus, there is a different page for each team for offensive players (i.e. hitters) and pitchers. A challenge unique to the Fan Graph data scraping was that the pages were generated via JavaScript, so simply connecting to the site with R to get the HTML for the pages would not actually include the required data, rather just the format of the page. To remedy this, a short PhantomJS script was written to initiate the JavaScript on the Fan Graph site. From within R, we can edit this script (specifically the var URL) to get all players (both pitchers and hitters) and all teams. Then, we simply parse the HTML code, get the WAR values for each player, sum them together for each team for each year, reformat the data, and finally save it (team, year, total WAR) to a csv file which we can load for our analysis.

var url ='https://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=8&season=2017&month=0&season1=2017&ind=0&team=30&rost=0&age=0';
var page = new WebPage()
var fs = require('fs');


page.open(url, function (status) {
        just_wait();
});

function just_wait() {
    setTimeout(function() {
               fs.write('1.html', page.content, 'w');
            phantom.exit();
    }, 2500);
}
get_WAR_from_team_year=function(team, year)
{   
    
    #Batters first
    prep_jc_url(team, year, 1)
    system("phantomjs scrap_final.js")
    myHtml <- readLines("1.html")
    myHtml_sub = myHtml[ grep('LeaderBoard1_dg1_ctl00', myHtml)[1]: grep('LeaderBoard1_dg1_SharedCalendarContainer', myHtml)]
    myHtml_sub = myHtml_sub [ grep('playerid=', myHtml_sub) ]
    tmp_o=lapply(myHtml_sub, get_WAR_from_PIDrow, batting=1)
    
    #then pitchers
    prep_jc_url(team, year, 0)
    system("phantomjs scrap_final.js")
    myHtml <- readLines("1.html")
    myHtml_sub = myHtml[ grep('LeaderBoard1_dg1_ctl00', myHtml)[1]: grep('LeaderBoard1_dg1_SharedCalendarContainer', myHtml)]
    myHtml_sub = myHtml_sub [ grep('playerid=', myHtml_sub) ]
    tmp_p=lapply(myHtml_sub, get_WAR_from_PIDrow, batting=0)
    
    if(length(tmp_p) > 0)
    {
    
        tmp_o_mat = t(matrix(unlist(tmp_o), nrow  =2))
        tmp_p_mat = t(matrix(unlist(tmp_p), nrow  =2))
        
        
        #could return list, but why not just sum now and return sum? 
        df_war = data.frame(rbind(tmp_o_mat, tmp_p_mat))
        #return(df_war)
        ret = sum(as.numeric(as.character(df_war[,1])))
        return(ret)
    }
    else{return(0)}
}


prep_jc_url=function (team, year, batting)
{
     url = paste0('https://www.fangraphs.com/leaders.aspx?pos=all&stats=', ifelse(batting==1, "bat", "pit"),'&lg=all&qual=0&type=8&season=', year, '&month=0&season1=', year, '&ind=0&team=', team, '&rost=0&age=0')
     lines <- readLines("scrap_final.js")
     lines[1] <- paste0("var url ='", url ,"';")
     writeLines(lines, "scrap_final.js")
}



get_WAR_from_PIDrow = function(string, batting)
{
    #WAR
    tmp = gregexpr('<td class="grid_line_regular" align="right">', string)
    start = tmp[[1]][(length(tmp[[1]]))]
    tmp_string = substr(string, start=start, stop = start+100)
    WAR = as.numeric(split_string_on_gele(tmp_string)[ifelse(batting==1, 3, 7)])
    
    #Name
    tmp = gregexpr('playerid', string)
    start = tmp[[1]][(length(tmp[[1]]))]
    tmp_string = substr(string, start=start, stop = start+100)
    name = split_string_on_gele(tmp_string)[2]
    
    return(c(WAR, name))
}

split_string_on_gele = function(string)
{
    matrix(strsplit(string, "[><]"))[[1]]
}



FanGraphWar = matrix(nrow=30, ncol= 115)


start.time <- Sys.time()
for (team in 1:30)
{
    for(year in 1:115)
    {
        reportedTeamWAR = get_WAR_from_team_year(team, year+1902)
        FanGraphWar[team, year] = reportedTeamWAR
        print(paste(reportedTeamWAR, team, year))
    }
    
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

FGW = as.data.frame(t(FanGraphWar))

#label with franchID's 
FGW$ANA = FGW$V1 
FGW$BAL = FGW$V2 
FGW$BOS = FGW$V3 
FGW$CHW = FGW$V4 
FGW$CLE = FGW$V5 
FGW$DET = FGW$V6
FGW$KCR = FGW$V7 
FGW$MIN = FGW$V8 
FGW$NYY = FGW$V9 
FGW$OAK = FGW$V10 
FGW$SEA = FGW$V11 
FGW$TBD = FGW$V12 
FGW$TEX = FGW$V13 
FGW$TOR = FGW$V14 
FGW$ARI = FGW$V15 
FGW$ATL = FGW$V16 
FGW$CHC = FGW$V17 
FGW$CIN = FGW$V18 
FGW$COL = FGW$V19 
FGW$FLA = FGW$V20 
FGW$HOU = FGW$V21 
FGW$LAD = FGW$V22 
FGW$MIL = FGW$V23
FGW$WSN = FGW$V24 
FGW$NYM = FGW$V25 
FGW$PHI = FGW$V26
FGW$PIT = FGW$V27 
FGW$STL = FGW$V28 
FGW$SDP = FGW$V29 
FGW$SFG = FGW$V30

#make sure we only have the 30 current teams
library(Lahman)
keeps = tail(Teams$franchID, n = 30)
FGW = FGW[, (names(FGW) %in% keeps)]

#transform it to a better format

FanGraphWARs = df <- data.frame(franchID=character(), year=integer(), WAR=double(), stringsAsFactors=FALSE) 


for (team in colnames(FGW))
{
    zero_marker = 1
    for(year_idx in 1:115)
    {
        tmp = as.data.frame(x=list(franchID = team, year = year_idx + 1902, WAR = FGW[year_idx, team]))
        
        if (zero_marker == 1)
        {
            if (tmp$WAR == 0.0)
            {
                #do nothing, before team was created 
            }
            else
            {
                zero_marker = 0 #we know the franchise has been created now
                FanGraphWARs = rbind(FanGraphWARs , tmp)
            }
        }
        else
        {
            FanGraphWARs = rbind(FanGraphWARs , tmp)
        }
    }
}

write.csv(FanGraphWARs, "FanGraphWARs_better_format.csv")

Combine data sets

After getting our three datasets into csv format, the next step was to combine them into on usable data frame for our analysis. Notably, Baseball Prospectus’s data only went as far back as 1949, so we opted to only include data from 1949 from all three sources in our analyses. An additional point of note for those unfamiliar with the WAR statistic, the statistic takes into account how many wins a team of replacement players would get over the course of a season (and a player’s WAR is meant to be a prediction of how many additional wins a player provides in place of this replacement player). In other words, our calculated team total WARs are not in fact the predicted number of wins, but the predicted number of wins over a replacement team. For Baseball Reference and Fan Graph, a team of replacement players is expected to win 0.294 of its games, while Baseball Prospectus sets this line at 0.320. To get predicted wins, we multiply the number of games a team played in a season (acquired from Lahman) by these replacement win percentages, then added the team total WAR.

#Read in BBProspectus data and remove extraneous columns
BBProspectus <- read.csv("prospectus.csv")
BBProspectus <- BBProspectus %>%
    rename(BBproWAR = WARP, teamID = franchID) %>%
    select(c(-X, -W))

#Baseball Prospectus had some funny names, so first we need to align them with our other datasets using Lahman
BBProspectus$franchID = NA

BBProspectus = BBProspectus %>%
    filter(yearID >= 1949)

for (i in 1:length(BBProspectus[, 1]))
{
    
    BBProspectus[i, ]$franchID = 
        as.character(unique(
            Teams$franchID[as.character(Teams$teamID) == as.character(BBProspectus[i, ]$teamID)])[1])
}    

BBProspectus = BBProspectus %>%
    select(c(-teamID))

#Read in BBProspectus data and remove extraneous columns
BBReference <- read.csv("BBReference.csv")
BBReference <- BBReference %>%
    filter(yearID >= 1949) %>%
    rename(BRefWAR = WAR) %>%
    select(c(-X, -source, -W))

BBReference = BBReference[!is.na(BBReference$BRefWAR), ]
#Baseball Reference Team IDs are based on the team of that year, need to correct this for a few teams
alts = c("LAA", "CAL", "TBR", "MIA", "PHA")
delt = c("ANA", "ANA", "TBD", "FLA", "OAK")

for (i in 1:length(BBReference[, 1]))
{
    if ((BBReference[i, ]$franchID %in% alts))
    {
        BBReference[i, ]$franchID =  delt[match(BBReference[i, ]$franchID, alts)]
    }
}

FanGraphs <- read.csv("Conor/FanGraphWARs_better_format.csv")
FanGraphs <- FanGraphs %>%
    rename(FanGraphWAR = WAR, yearID=year) %>%
    select(c(-X))

#Inner join the three for a complete data set
data <- inner_join(BBReference, BBProspectus, by = c("franchID", "yearID"))
data = inner_join(data, FanGraphs, by =c("franchID", "yearID"))

#Join combined data with true number of wins in Lahman's dataset to create wide data set 
team_sub <- Teams %>%
    select(yearID, franchID, W, G, lgID)  %>%
    rename(Truth = W, Games = G)
data_wide <- left_join(data, team_sub, by = c("franchID", "yearID"))

data_wide = data_wide %>%
    filter(yearID %in% 1949:2016) %>%
    mutate(BRefWAR = BRefWAR + Games * 0.294, FanGraphWAR = FanGraphWAR + Games * 0.294, BBproWAR = BBproWAR + Games*0.320) %>%
    rename(BRefPred = BRefWAR, FanGraphPred = FanGraphWAR, BBproPred = BBproWAR)

data_long <- gather(data_wide, source, WAR, -c(franchID, yearID))

Exploratory Analysis

What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions?

How well does WAR compare to the true number of wins?

Before doing any statistical analysis, as a sanity check and easy visualization of the data we created box plots of the residuals (predicted - true wins) for each team and each WAR statistic. Ideally, we would hope for these residuals to be centered at 0, with little variance (i.e. short boxes) and little to no outliers. As can be seen in the plot below, our results for Fan Graph and Baseball Reference both fit these criteria. However, Baseball Prospectus had much more variability in its residuals, having larger residuals (i.e. longer boxes), more outliers, and was less centered around 0. Additionally, for Fan Graph and Baseball Reference there are no obvious team effects.

data_long <- gather(data_wide, Pred, Val, -c(franchID, yearID, Truth, Games, lgID))
data_long2 <- data_long %>%
  mutate(Residuals = (as.numeric(as.character(Val)) - as.numeric(as.character(Truth))))

gg2 <- data_long2 %>%
  ggplot(aes(x = franchID, y = Residuals)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap(~ Pred, ncol = 1) +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Team Name") +
  ggtitle("")
gg2

First, in our more formal analysis, we wanted to assess at the most basic level which predictor performed the best. We defined our assessment in terms of the mean square error (MSE) between the true wins a team had during a season and the wins predicted by each of the three algorithms. In terms of MSE, we found that the Baseball Reference performed the best (MSE = 5.13), then Fan Graph (MSE = 6.23), and last was Baseball Prospectus (MSE = 9.25), which matched our initial assessment from our box plot. Additionally, we also wanted to assess how well the different WAR statistics correlated with true wins. Ideally, we would expect slopes of 1 (i.e. each 1 additional WAR corresponds to 1 additional team win). In terms of this, Fan Graph had the best model, with a slope of 1.002, followed closely by Baseball Reference 0.986, and lastly —far behind the other two— was Baseball Prospectus at 0.692.

In visualizing the differences between our predictors and the true wins, we produced the below plot, histograms of the differences between prediction and truth. All three appear to be normally distributed. Baseball Reference appears to have the most mass around zero, followed closely by Fan Graph (though Fan Graph seems to have a lot more mass between 5 and 15 than Baseball Reference). Again, Baseball Prospectus appeared to be the worst, having much less mass around zero, and more values near -20 and 20 than both of the other two.

In looking at the means differences, we see that Baseball Reference is close to 0 (as we would expect). However, both Baseball Prospectus and Fan Graphs have means above 2, and indication of possible systematic error (and a possible explanation of why Fan Graph had all that extra mass between 5 and 15). In looking at the standard deviations, we continue to see the trend of Baseball Prospectus being the worst predictor of the three, while Baseball Reference and Fan Graphs have similar standard deviations, with Baseball Reference being slightly better, as we would have expected from our graph.

statistics_matrix = matrix(nrow = 3, ncol = 4)
rownames(statistics_matrix) = c("Fan Graph", "Baseball Prospectus", "Baseball Reference")
colnames(statistics_matrix) = c("MSE", "Mean Difference", "Standard Deviation", "Slope")

FG_diff = data_wide$FanGraphPred - data_wide$Truth
statistics_matrix[1,1] = MSE(data_wide$Truth, data_wide$FanGraphPred)
statistics_matrix[1,2] = mean(FG_diff)
statistics_matrix[1,3] = sd(FG_diff)
statistics_matrix[1,4] = lm(Truth ~ FanGraphPred, data = data_wide)$coef[[2]]

BBP_diff = data_wide$BBproPred - data_wide$Truth
statistics_matrix[2,1] = MSE(data_wide$Truth, data_wide$BBproPred)
statistics_matrix[2,2] = mean(BBP_diff)
statistics_matrix[2,3] = sd(BBP_diff)
statistics_matrix[2,4] = lm(Truth ~ BBproPred, data = data_wide)$coef[[2]]

BRef_diff = data_wide$BRefPred - data_wide$Truth
statistics_matrix[3,1] = MSE(data_wide$Truth, data_wide$BRefPred)
statistics_matrix[3,2] = mean(BRef_diff)
statistics_matrix[3,3] = sd(BRef_diff)
statistics_matrix[3,4] = lm(Truth ~ BRefPred, data = data_wide)$coef[[2]]

kable(statistics_matrix)
MSE Mean Difference Standard Deviation Slope
Fan Graph 6.239231 2.5455374 5.698188 1.0017252
Baseball Prospectus 9.256546 2.8262118 8.817408 0.6918975
Baseball Reference 5.136446 0.2340364 5.132780 0.9863421
diff2 = cbind(FG_diff, BBP_diff, BRef_diff)
diff2 = as.data.frame(diff2)

diff2 %>% ggplot() + 
    geom_histogram(aes(x=FG_diff, fill="Fan Graph"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    geom_histogram(aes(x=BBP_diff, fill="Baseball Prospectus"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    geom_histogram(aes(x=BRef_diff, fill="Baseball Reference"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    ggtitle("Difference Between Predicted and True Wins") +
    xlab("Difference") +
    ylab("Count") +
    scale_fill_manual(values = c("Fan Graph" = rgb(1,0,0), "Baseball Prospectus" = rgb(0,1,0), "Baseball Reference"=rgb(0,0,1)))

Due to the number of baseline wins assumed by the predictors, we were concerned that the predictions might have suffered at the extremes (i.e. when teams win less than ~55 games or more than ~100). To visually assess this, we plotted the true wins against the predicted wins for each predictor. In this graph, we see Baseball Prospectus had a tendency to greatly overestimate the wins of poorly performing teams (in one case predicting close to 90 wins for a 50 win team!). It also had an apparent trend of overpredicting wins throughout its range, again a suggestion of systematic error. Both Baseball Reference and Fan Graph also showed some signs of overpredicting wins for bad teams and underestimating them for great teams, but their differences were less extreme. Additionally, the middle range of both of these predictors performed very well. The smaller residuals in these two predictors was also apparent when all three were plotted together.

data_wide %>% ggplot(aes(x = Truth, y = BBproPred)) +  
    geom_point(color="green", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Baseball Prospectus True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot(aes(x = Truth, y = FanGraphPred)) +  
    geom_point(color="red", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Fan Graph True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot(aes(x = Truth, y = BRefPred)) +  
    geom_point(color="blue", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Baseball Reference True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot() +
    geom_point(aes(x = Truth, y = FanGraphPred, color="Fan Graph"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BBproPred, color="Baseball Prospectus"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BRefPred, color="Baseball Reference"), alpha = 0.5) +
    scale_fill_manual(values = c("Fan Graph" = rgb(1,0,0), "Baseball Prospectus" = rgb(0,1,0), "Baseball Reference"=rgb(0,0,1))) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

Based on our above analyses, it was quite clear that Fan Graph and Baseball Reference greatly outperformed Baseball Prospectus, with Baseball Reference having a slight edge over Fan Graph. However, in researching about the calculations of WAR, there were a few side analyses we also wanted to perform.

In 2013, Baseball Reference published an article about how it updated its WAR baseline wins; before, it had a different number of baseline wins for the American and National Leagues (AL was still 0.294 before, but the NL was changed from 0.320). The article also discusses how part of the reason for the change was that FanGraphs was already using 0.294 for both leagues. Based on this, we were interested in investigating if there was a difference in the performance, quantified by MSE, between American and National League teams for the predictors.

Next, Fan Graph has reported that it awards 1000 WAR per season, 570 to hitters and 430 to pitchers. Though this balance might be about right for today’s game, it might not make sense for other eras of the game. Specifically, one might expect the Deadball 2 Era (1960-1969) where pitchers dominated hitters (to the point where Carl Yastrzemski won the batting title in 1968 with a record low average of .301) and the Steroid Era (1985-2005) where hitters like Barry Bonds, Sammy Sosa, Mark Maguire, and Jose Canseco began hitting homeruns at rates never seen before to have different balances.

Additionally, in continuing our AL versus NL comparison, we examined the difference specifically during the Designated Hitter Era (1973-present) when there actually existed a fundamental difference between the two leagues (AL has DH, NL does not). Lastly, we also examined the Wildcard Era (1994-present) where an additional playoff spot was added, which is thought to make the game more competetive as more teams remain in the playoff hunt longer (and thus strive more for wins rather than just contract boosting stats).

league_splits = data_wide %>%
    group_by(lgID) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) ) %>%
    rename("League" = lgID)



deadball2 = data_wide %>%
    filter(yearID %in% 1960:1969) %>%
    group_by(lgID) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) ) %>%
    rename("League" = lgID)
deadball22 = data_wide %>%
    filter(yearID %in% 1960:1969) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) )
deadball22 = cbind("League"="MLB", deadball22)
#steroid = cbind(Era="Steroid", steroid)
deadball2 = rbind(deadball2, deadball22)
#deadball2 = cbind(Era="Deadball", deadball2 )



steroid = data_wide %>%
    filter(yearID %in% 1985:2005) %>%
    group_by(lgID) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) ) %>%
    rename("League" = lgID)

steroid1 = data_wide %>%
    filter(yearID %in% 1985:2005) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) ) 
steroid1 = cbind("League"="MLB", steroid1)
steroid = rbind(steroid, steroid1)


dh = data_wide %>%
    filter(yearID %in% 1973:2016) %>%
    group_by(lgID) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) ) %>%
    rename("League" = lgID)

dh1 = data_wide %>%
    filter(yearID %in% 1973:2016) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) ) 

dh1 = cbind("League"="MLB", dh1)
dh = rbind(dh, dh1)






wildcard = data_wide %>%
    filter(yearID %in% 1994:2016) %>%
    group_by(lgID) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) ) %>%
    rename("League" = lgID)

wildcard1 = data_wide %>%
    filter(yearID %in% 1994:2016) %>%
    summarize("Fan Graph"=MSE(Truth, FanGraphPred), "Baseball Prospectus"=MSE(Truth, BBproPred),
              "Baseball Reference"=MSE(Truth, BRefPred) ) 

wildcard1 = cbind("League"="MLB", wildcard1)
wildcard = rbind(wildcard, wildcard1)

Interestingly, there was a difference in all three algorithms, with the predicted wins for the American League teams having less error than the predicted wins for the National League teams. However, the differences were not so extreme as to change the rank of the predictors: the best predictor in terms of MSE was Baseball Reference’s AL predictions, followed by its NL, then Fan Graphs AL and NL respectively, and finally Baseball Prospectus’s AL and NL.

Analysis by League
League Fan Graph Baseball Prospectus Baseball Reference
AL 5.779660 8.666685 5.082375
NL 6.689836 9.840455 5.192867

As expected, Fan Graph performed less well during the Deadball 2 Era. However, so did the other two predictors, which actually had greater performance reduction than Fan Graph. Even more surprising, all three predictors saw performance improvements during the Steroid Era, with Baseball Reference and Baseball Propsectus having the greatest performance improvements. These results suggest that all three algorithms are weighted too heavily towards batters performance (moreso for Baseball Reference and Baseball Prospectus than Fan Graph) in terms of predicting true wins.

All three predictors performed better in the DH Era compared to the entirety of the dataset, though interestingly all three saw more improvement in predicting NL wins (where there was no rule change) than AL wins (where there was).This may be an artifact of the improvement in the Steroid Era (where we saw all three predictors perform better) and that the homerun kings —Barry Bonds, Mark Maguire, and Sammy Sosa— were all in playing in the National League.

In analyzing the Wildcard Era, we see that the MSE values for Fan Graph and Baseball Reference were nearly equal to their respective MSE values for the entire dataset: Fan Graph having MSE’s of 6.25 and 6.24 and Baseball Reference having MSE’s of 5.14 and 5.13 in the Wildcard and whole dataset, respectively. Baseball Prospectus had great improvement in the Wildcard Era, having an MSE of 7.90 compared to a 9.26 for the whole dataset. Despite its improvement in this subset, it still performed worse than both of the other predictors.

Deadball 2 Era
League Fan Graph Baseball Prospectus Baseball Reference
AL 6.031860 9.406519 5.788779
NL 7.430569 12.531146 6.059422
MLB 6.825169 11.207687 5.936909
Steroid Era
League Fan Graph Baseball Prospectus Baseball Reference
AL 5.708961 7.473141 4.967914
NL 6.665210 9.242603 5.209278
MLB 6.187852 8.371919 5.085556
Designated Hitter Era
League Fan Graph Baseball Prospectus Baseball Reference
AL 5.498081 7.895890 4.911216
NL 6.511387 9.050952 4.997089
MLB 6.006324 8.470550 4.952662
Wildcard Era
League Fan Graph Baseball Prospectus Baseball Reference
AL 5.954845 6.736583 5.337558
NL 6.516749 8.886120 4.932302
MLB 6.245103 7.896246 5.136775

Final Analysis

Based on our above analysis, it was clear that Baseball Prospectus was the worst predictor of the three, continually performing substantially worse than the other two predictors. Baseball Reference did seem to have an edge over Fan Graph in most of our categories, but Fan Graph did notably have a better relationship in terms of the correlation between predicted and true wins (1.002 compared to Baseball Reference’s 0.986) and also showed less variability to pitcher dominant versus hitter dominant eras.

Based on this, we attempted to build a combined predictor that used all three WAR statistics to create a new WAR statistic that outperformed all three of the previous ones in terms of MSE. We did this using 10-fold cross validation. For each fold, we used the optim function to optimally fit the coefficients based on the data in the other nine folds, then assessing it on the tenth fold (serving as our test set). We average this across our ten folds, getting our final model. The evaluation of this final model on the entire test set returned an MSE of 4.69, outperforming Baseball Reference by nearly 10%! Additionally, in observing the below plot we see that the in the plot of predicted versus true wins, our predictor follows the line more closely and with less variability than all of the other three predictors.

set.seed(123)
k <- 10
#Perform k-fold cross validation
folds <- createFolds(1:nrow(data_wide), k)

#Define function for performing gradient-free optimization
fn <- function(x){
  ref <- x[1]
  pros <- x[2]
  fan <- x[3]
  pred <- ref*data_wide$BRefPred + pros*data_wide$BBproPred + fan*data_wide$FanGraphPred 
  MSE(pred, data_wide$Truth)
}

#Initialize output matrix
MSE_val <- matrix(NA, 4, k)
coeffs <- matrix(NA, 3, k)

for(i in 1:k){
  test_df <- data_wide[folds[[i]],]
  train_df <- data_wide[-folds[[i]],]
  
  m1 <- lm(Truth ~ BRefPred, data = train_df)
  m2 <- lm(Truth ~ BBproPred, data = train_df)
  m3 <- lm(Truth ~ FanGraphPred, data = train_df)
  m4 <- optim(c(1, 1, 1), fn, method = c("Nelder-Mead"))
  
  ## test prediction on test dataset
  pred1 <- predict(m1, newdata = test_df)
  pred2 <- predict(m2, newdata = test_df)
  pred3 <- predict(m3, newdata = test_df)
  
  test_df_lm <- cbind(test_df,
                      BRef_lmpred = pred1,
                      BBpro_lmpred = pred2, 
                      FanGraph_lmpred = pred3)
  
  
  MSE_val[1, i] <- MSE(test_df_lm$Truth, test_df_lm$BRef_lmpred)
  MSE_val[2, i] <- MSE(test_df_lm$Truth, test_df_lm$BBpro_lmpred)
  MSE_val[3, i] <- MSE(test_df_lm$Truth, test_df_lm$FanGraph_lmpred)
  MSE_val[4, i] <- m4$value
  for(j in 1:3){
    coeffs[j, i] <- m4$par[j]
  }
}

rownames(MSE_val) <- c("Baseball Reference", "Baseball Prospectus", "Fangraph", "Combined Model")
apply(MSE_val, 1, mean)
##  Baseball Reference Baseball Prospectus            Fangraph 
##            5.130914            7.658060            5.694803 
##      Combined Model 
##            4.685790
#Obtain coefficients from combined model
final_coef <- apply(coeffs, 1, mean)

#Make predictions based on combined model
data_wide <- data_wide %>%
  mutate(CombinedPred = final_coef[1]*data_wide$BRefPred + final_coef[2]*data_wide$BBproPred + final_coef[3]*data_wide$FanGraphPred )


data_wide %>% ggplot() +
    geom_point(aes(x = Truth, y = FanGraphPred, color="Fan Graph"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BBproPred, color="Baseball Prospectus"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BRefPred, color="Baseball Reference"), alpha = 0.5) +
    geom_point(aes(x = Truth, y = CombinedPred, color = "Final Model"), alpha = 0.5) +
    scale_fill_manual(values = c("Fan Graph" = rgb(1, 0, 0), "Baseball Prospectus" = rgb(0, 1, 0), "Baseball Reference"=rgb(0, 0, 1), "Final Model" = rgb(0, 0, 0))) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)